This is an R Markdown document. Markdown is a simple formatting
syntax for authoring HTML, PDF, and MS Word documents. For more details
on using R Markdown see https://rmarkdown.rstudio.com/lesson-1.html. In the
header section above, you can configure options
for this document, including title, author(s), and additional style
and output options. The nocite and
bibliography lines automatically add a bibliography for the
software packages you have used. Remove the nocite line to
suppress references you haven’t cited. You may delete this instruction
section.
This study is a reproduction of:
Chakraborty, J. 2021. Social inequities in the distribution of COVID-19: An intra-categorical analysis of people with disabilities in the U.S. Disability and Health Journal 14:1-5. https://doi.org/10.1016/j.dhjo.2020.101007
The original study was published urgently during the initial phases of the COVID-19 pandemic. This reproduction seeks to amend procedures in a reproducible, open source format. Given this context, the reproduction study seeks to determine similar methods to make the study reproducible and to identify possible improvements to the methods. By reproducing Chakraborty’s study, this reproduction could hopefully expand access to and stimulate dialogue about the study of social vulnerability as it relates to epidemiology. As contagious disease becomes increasingly prevalent, these types of study methods are critical to refine and share.
Key words: COVID-19, Disability, social vulnerability,
cluster-analysis, reproducibilitySubject: select from the BePress
TaxonomyDate created: 03-10-2025Date modified: 05-20-2025Spatial Coverage: Continental United States (lower 48
and Washington DC)Spatial Resolution: US CountySpatial Reference System: Contiguous USA Albers Equal
Area projection EPSG:5070Temporal Coverage: From 1/22/2020 (when John Hopkins
began collecting the data) to 8/1/2020 (when the data was retrieved for
the original study)Temporal Resolution: rates were calculated off of the
whole study period (1 temporal unit)Spatial Coverage: Continental United States (lower 48
and Washington DC)Spatial Resolution: US CountySpatial Reference System: Contiguous USA Albers Equal
Area projection EPSG:5070Temporal Coverage: From 1/22/2020 (when John Hopkins
began collecting the data) to 8/1/2020 (when the data was retrieved for
the original study)Temporal Resolution: rates were calculated off of the
whole study period (1 temporal unit)This is a reproduction study for Chakraborty’s 2021
study “Social inequities in the distribution of COVID-19: An
intra-categorical analysis of people with disabilities in the U.S.” The
reproduction seeks to use the SpatialEpi package to
recalculate the likelihood clusters from the original study using
reproducible, open source software. Given differences between the
SaTScan software used in the original study and Kuldorff Spatial Scan
from the SpatialEpi package methods, it is hypothesized
that Kuldorff Spatial Scan will be better able to identify clusters
around urban corridors and large metropolitan regions with multiple
urban centers.
Describe the data sources and variables to be used. Data sources may include plans for observing and recording primary data or descriptions of secondary data. For secondary data sources with numerous variables, the analysis plan authors may focus on documenting only the variables intended for use in the study.
Primary data sources for the study are to include … . Secondary data sources for the study are to include … .
Each of the next subsections describes one data source.
Title: American Community Survey (ACS) five-year
estimate (2014-2018)Abstract: Demographic breakdown of disabled
populationSpatial Coverage: USASpatial Resolution: CountySpatial Representation Type: vector multipolygonSpatial Reference System: EPSG 4269Temporal Coverage: 2014-2018Temporal Resolution: 5-year estimateLineage: Retrieved values from Census website tables
S1810 and C18130 tables using tidyCensusDistribution: Publicly availableConstraints: Public DataThe American Community Survey (ACS) five-year estimate (2014-2018) variables used in the study are outlined in the table below. Details on ACS data collection can be found at https://www.census.gov/topics/health/disability/guidance/data-collection-acs.html and details on sampling methods and accuracy can be found at https://www.census.gov/programs-surveys/acs/technical-documentation/code-lists.html.
| Variable Name in Study | ACS Variable name |
|---|---|
| percent of total civilian non-institutionalized population with a disability | S1810_C03_001E |
| Race | |
| percent w disability: White alone | S1810_C03_004E |
| percent w disability: Black alone | S1810_C03_005E |
| percent w disability: Native American | S1810_C03_006E |
| percent w disability: Asian alone | S1810_C03_007E |
| percent w disability: Other race | S1810_C03_009E |
| Ethnicity | |
| percent w disability: Non-Hispanic White | S1810_C03_0011E |
| percent w disability: Hispanic | S1810_C03_012E |
| percent w disability: Non-Hispanic non-White | (S1810_C02_001E - S1810_C02_011E - S1810_C02_012E) / (S1810_C01_001E - S1810_C01_011E - S1810_C01_012E) * 100 |
| percent w disability: Other race | S1810_C03_009E |
| Poverty | |
| percent w disability: Below poverty level | (C18130_004E + C18130_011E + C18130_018E) / C18130_001E * 100 |
| percent w disability: Above poverty level | (C18130_005E + C18130_012E + C18130_019E) / C18130_001E * 100 |
| Age | |
| percent w disability: 5-17 | S1810_C03_014E |
| percent w disability: 18-34 | S1810_C03_015E |
| percent w disability: 35-64 | S1810_C03_016E |
| percent w disability: 65-74 | S1810_C03_017E |
| percent w disability: 75+ | S1810_C03_018E |
| Biological sex | |
| percent w disability: male | S1810_C03_001E |
| percent w disability: female | S1810_C03_003E |
American Community Survey (ACS) data for sociodemographic
subcategories of people with disabilities can be accessed by using the
tidycensus package to query the Census API. This requires
an API key which can be acquired at api.census.gov/data/key_signup.html.
## [conflicted] Will prefer dplyr::filter over any other package.
Data on COVID-19 cases from the Johns Hopkins University dashboard
have been provided directly with the research compendium because the
data is no longer available online in the state in which it was
downloaded on August 1, 2020. The dashboard and cumulative counts of
COVID-19 cases and deaths were continually updated, so an exact
reproduction required communication with the original author, Jayajit
Chakraborty, for assistance with provision of data from August 1, 2020.
The data includes an estimate of the total population
(POP_ESTIMA) and confirmed COVID-19 cases
(Confirmed). The COVID-19 case data expresses cumulative
count of reported COVID-19 from 1/22/2020 to 8/1/2020. Although metadata
for this particular resource is no longer available from the original
source, one can reasonably assume that the total population estimate was
based on the 2014-2018 5-year ACS estimate, as the 2019 estimates data
had not been released yet.
Versions of the data can be found at the John Hopkins CCSE COVID-19 Data Repository (https://github.com/CSSEGISandData/COVID-19). However, archived data only provides summaries at the national scale. We received the COVID-19 case data through 8/1/2020 at the county level from the author, as there is no readily apparent way to access archived data from the Johns Hopkins University Center for Systems Science Engineering database.
At the time of this study pre-registration, the authors had no prior knowledge of the geography of the study region with regards to the COVID-19 and social vulnerability phenomena to be studied. This study is related to no prior studies by the authors
The authors have not engaged with any of the data sources prior to the study.
Method used to factor in spatial autocorrelation may be flawed due to the SaTScan algorithm which generates clusters hierarchically beginning with the cluster with the largest relative risk (RR). By calculating the whole cluster first, it eliminates possible nearby clusters that overlap with the range of the previous cluster. Therefore, a Kuldorff spatial scan will be used to calculate RR likelihood.
The original study extent is the lower 48 states and Washington D.C. Therefore, Alaska, Hawai’i and Puerto Rico are removed from the data (workflow step 1). Data on people with disabilities in poverty is derived from a different census table (C18130) than data on people with disabilities and age, race, ethnicity, age, and biological sex (S1810). Therefore, join the poverty data to the other data using the GEOID (workflow step 3). Also transform the ACS geographic data into Contiguous USA Albers Equal Area projection and fix geometry errors.
Optionally, save the raw ACS data to
data/raw/public/acs.gpkg for use in GIS software.
Calculate independent socio-demographic variables of people with disabilities as percentages for each sub-category of disability (race, ethnicity, poverty, age, and biological sex) and remove raw census data from the data frame (workflow step 4). Reproject the data into an Albers equal area conic projection.
Calculate the COVID incidence rate as the cases per 100,000 people (workflow step 2). Convert the COVID data to a non-geographic data frame.
Join dependent COVID data to independent ACS demographic data.
Unplanned deviation for reproduction: There is one county with missing disability and poverty data. This was not mentioned in the original study or in our pre-analyis plan. However, we replace the missing data with zeros, producing results identical to Chakraborty’s.
| fips | statefp | county | county_st | covid_rate | dis_pct | white_pct | black_pct | native_pct | asian_pct | other_pct | non_hisp_white_pct | hisp_pct | non_hisp_non_white_pct | bpov_pct | apov_pct | pct_5_17 | pct_18_34 | pct_35_64 | pct_65_74 | pct_75 | male_pct | female_pct | pop | cases | x | y |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 35039 | 35 | Rio Arriba | Rio Arriba County, New Mexico | 751.17 | 16.06467 | 10.77458 | 0.038371 | 2.744807 | 0.038371 | 2.468536 | 2.355981 | 11.39619 | 2.312494 | NA | NA | 0.3069682 | 1.258569 | 6.781439 | 3.391998 | 4.279648 | 8.556738 | 7.50793 | 39006 | 293 | -106.6932 | 36.50962 |
Map the county level distribution of COVID-19 incidence rates, comparing to Figure 1 of the original study.
Unplanned deviation for reproduction: We also map the spatial distribution of the percent of people with any disability to improve our understanding of the geographic patterns and relationships of between the overarching independent variable (percentage of people with disability) and the dependent variable (COVID-19 incidence rate).
Calculate descriptive statistics for dependent COVID-19 rate and independent socio-demographic characteristics, reproducing the min, max, mean, and SD columns of original study table 1.
Planned deviation for reanalysis: We also calculate the Shapiro Wilk test for normality.
| min | max | mean | SD | ShapiroWilk | p | |
|---|---|---|---|---|---|---|
| covid_rate | 0.00 | 14257.17 | 966.90 | 1003.96 | 0.74 | 0 |
| dis_pct | 3.83 | 33.71 | 15.95 | 4.40 | 0.98 | 0 |
| white_pct | 0.85 | 33.26 | 13.55 | 4.63 | 0.98 | 0 |
| black_pct | 0.00 | 20.70 | 1.48 | 2.66 | 0.61 | 0 |
| native_pct | 0.00 | 13.74 | 0.28 | 0.94 | 0.28 | 0 |
| asian_pct | 0.00 | 3.45 | 0.09 | 0.18 | 0.51 | 0 |
| other_pct | 0.00 | 15.24 | 0.55 | 0.65 | 0.57 | 0 |
| non_hisp_white_pct | 0.10 | 33.16 | 12.84 | 4.81 | 0.99 | 0 |
| hisp_pct | 0.00 | 25.26 | 0.99 | 2.15 | 0.42 | 0 |
| non_hisp_non_white_pct | 0.00 | 20.93 | 2.13 | 2.75 | 0.70 | 0 |
| bpov_pct | 0.00 | 14.97 | 3.57 | 1.85 | 0.93 | 0 |
| apov_pct | 0.00 | 27.30 | 12.48 | 3.06 | 0.99 | 0 |
| pct_5_17 | 0.00 | 5.08 | 1.03 | 0.48 | 0.95 | 0 |
| pct_18_34 | 0.00 | 5.59 | 1.56 | 0.67 | 0.96 | 0 |
| pct_35_64 | 1.01 | 18.36 | 6.35 | 2.30 | 0.96 | 0 |
| pct_65_74 | 0.00 | 12.73 | 3.09 | 1.16 | 0.95 | 0 |
| pct_75 | 0.00 | 11.13 | 3.87 | 1.19 | 0.97 | 0 |
| male_pct | 1.30 | 18.19 | 8.06 | 2.37 | 0.98 | 0 |
| female_pct | 1.91 | 19.94 | 7.90 | 2.26 | 0.98 | 0 |
Compare reproduced descriptive statistics to original descriptive statistics. Difference is calculated as ‘reproduction study - original study’. Identical results will result in zero.
| min | max | mean | SD | |
|---|---|---|---|---|
| covid_rate | 0 | 0.17 | -0.1 | -0.04 |
| dis_pct | 0 | 0.00 | 0.0 | 0.00 |
| white_pct | 0 | 0.00 | 0.0 | 0.00 |
| black_pct | 0 | 0.00 | 0.0 | 0.00 |
| native_pct | 0 | 0.00 | 0.0 | 0.00 |
| asian_pct | 0 | 0.00 | 0.0 | 0.00 |
| other_pct | 0 | 0.00 | 0.0 | 0.00 |
| non_hisp_white_pct | 0 | 0.00 | 0.0 | 0.00 |
| hisp_pct | 0 | 0.00 | 0.0 | 0.00 |
| non_hisp_non_white_pct | 0 | 0.00 | 0.0 | 0.00 |
| bpov_pct | 0 | 0.00 | 0.0 | 0.00 |
| apov_pct | 0 | 0.00 | 0.0 | 0.00 |
| pct_5_17 | 0 | 0.00 | 0.0 | 0.00 |
| pct_18_34 | 0 | 0.00 | 0.0 | 0.00 |
| pct_35_64 | 0 | 0.00 | 0.0 | 0.00 |
| pct_65_74 | 0 | 0.00 | 0.0 | 0.00 |
| pct_75 | 0 | 0.00 | 0.0 | 0.00 |
| male_pct | 0 | 0.00 | 0.0 | 0.00 |
| female_pct | 0 | 0.00 | 0.0 | 0.00 |
The descriptive statistics are identical, except that the original study seems to have rounded the COVID-19 statistics to zero decimal places.
The county-level Pearson’s rho correlation coefficient was used to test association between intra-categorical rates of disability and COVID-19 incidence rates. As this was a parametric test, normality should be tested. A separate hypothesis was formulated for disability in aggregate and for each sociodemographic disability characteristic.
Calculate Pearson’s R Correlation Coefficient of each independent variable and the COVID-19 incidence rate, reproducing the Pearson’s R column of original study Table 1.
| variable | r | t | p |
|---|---|---|---|
| dis_pct | -0.060 | 3.350 | 0.000 |
| white_pct | -0.332 | 19.612 | 0.000 |
| black_pct | 0.460 | 28.847 | 0.000 |
| native_pct | 0.019 | 1.072 | 0.142 |
| asian_pct | 0.094 | 5.272 | 0.000 |
| other_pct | 0.026 | 1.460 | 0.072 |
| non_hisp_white_pct | -0.361 | 21.545 | 0.000 |
| hisp_pct | 0.119 | 6.686 | 0.000 |
| non_hisp_non_white_pct | 0.442 | 27.429 | 0.000 |
| bpov_pct | 0.106 | 5.914 | 0.000 |
| apov_pct | -0.151 | 8.513 | 0.000 |
| pct_5_17 | 0.084 | 4.688 | 0.000 |
| pct_18_34 | 0.063 | 3.493 | 0.000 |
| pct_35_64 | -0.008 | 0.460 | 0.323 |
| pct_65_74 | -0.091 | 5.113 | 0.000 |
| pct_75 | -0.186 | 10.541 | 0.000 |
| male_pct | -0.134 | 7.519 | 0.000 |
| female_pct | 0.023 | 1.305 | 0.096 |
| pop | 0.128 | 7.215 | 0.000 |
| cases | 0.209 | 11.891 | 0.000 |
| x | 0.099 | 5.540 | 0.000 |
| y | -0.412 | 25.195 | 0.000 |
Compare the reproduced Pearson’s r correlation coefficients
to the original study’s Pearson’s r correlation coefficients.
Stars indicates the significance level with two stars for
p < 0.01 and one star for p < 0.05.
Correlation difference rp_r_diff is calculated between the
reproduction study rp_r and original study
or_r as rp_r_diff = rp_r - or_r Direction
difference rp_dir_diff is calculated as
(rp_r > 0) - (or_r > 0), giving 0 if
both coefficients have the same direction, 1 if the
reproduction is positive and the original is negative, and
-1 if the reproduction is negative but the original is
positive.
| Variable | R | Sig. Level | R | Sig. Level | R | Sig. Level | Direction |
|---|---|---|---|---|---|---|---|
| dis_pct | -0.056 | 2 | -0.060 | 2 | -0.004 | 0 | 0 |
| white_pct | -0.326 | 2 | -0.332 | 2 | -0.006 | 0 | 0 |
| black_pct | 0.456 | 2 | 0.460 | 2 | 0.004 | 0 | 0 |
| native_pct | 0.020 | 0 | 0.019 | 0 | -0.001 | 0 | 0 |
| asian_pct | 0.097 | 2 | 0.094 | 2 | -0.003 | 0 | 0 |
| other_pct | 0.028 | 0 | 0.026 | 0 | -0.002 | 0 | 0 |
| non_hisp_white_pct | -0.355 | 2 | -0.361 | 2 | -0.006 | 0 | 0 |
| hisp_pct | 0.119 | 2 | 0.119 | 2 | 0.000 | 0 | 0 |
| non_hisp_non_white_pct | 0.439 | 2 | 0.442 | 2 | 0.003 | 0 | 0 |
| bpov_pct | 0.108 | 2 | 0.106 | 2 | -0.002 | 0 | 0 |
| apov_pct | -0.146 | 2 | -0.151 | 2 | -0.005 | 0 | 0 |
| pct_5_17 | 0.083 | 2 | 0.084 | 2 | 0.001 | 0 | 0 |
| pct_18_34 | 0.066 | 1 | 0.063 | 2 | -0.003 | 1 | 0 |
| pct_35_64 | -0.005 | 0 | -0.008 | 0 | -0.003 | 0 | 0 |
| pct_65_74 | -0.089 | 2 | -0.091 | 2 | -0.002 | 0 | 0 |
| pct_75 | -0.181 | 2 | -0.186 | 2 | -0.005 | 0 | 0 |
| male_pct | -0.131 | 2 | -0.134 | 2 | -0.003 | 0 | 0 |
| female_pct | 0.028 | 0 | 0.023 | 0 | -0.005 | 0 | 0 |
Reproduction correlation coefficients varied slightly from the original study coefficients by +/- 0.006. All but one Pearson’s correlation coefficient was significant to the same level, and the exception was age 18 to 34. Counter-intuitively, the correlation coefficient was slightly closer to 0 but the p value was also found to be more significant, suggesting a difference in the estimation of t and/or p, or a typographical error. All of the coefficients had the same direction.
Unplanned Deviation for Reproduction: We should
expect identical results for this correlation test, so we loaded the
original author’s data from Aug1GEEdata.csv to re-test the
statistic, calculated as unplanned_r below.
| variable | unplanned_r | or_r | diff |
|---|---|---|---|
| dis_pct | -0.056 | -0.056 | 0 |
| white_pct | -0.326 | -0.326 | 0 |
| black_pct | 0.456 | 0.456 | 0 |
| native_pct | 0.020 | 0.020 | 0 |
| asian_pct | 0.097 | 0.097 | 0 |
| other_pct | 0.028 | 0.028 | 0 |
| non_hisp_white_pct | -0.355 | -0.355 | 0 |
| hisp_pct | 0.119 | 0.119 | 0 |
| non_hisp_non_white_pct | 0.439 | 0.439 | 0 |
| bpov_pct | 0.108 | 0.108 | 0 |
| apov_pct | -0.146 | -0.146 | 0 |
| pct_5_17 | 0.083 | 0.083 | 0 |
| pct_18_34 | 0.066 | 0.066 | 0 |
| pct_35_64 | -0.005 | -0.005 | 0 |
| pct_65_74 | -0.089 | -0.089 | 0 |
| pct_75 | -0.181 | -0.181 | 0 |
| male_pct | -0.131 | -0.131 | 0 |
| female_pct | 0.028 | 0.028 | 0 |
The author’s original data produced coefficients identical to the original publication! Is it possible that the data values are correct but have been reassigned / transposed to different counties?
Unplanned Deviation for Reproduction: Considering the precise bitwise reproduction of descriptive statistics and of correlation statistics from author-provided data, we decided to recalculate the COVID-19 incidence rate with author-provided case and population data for comparison to the author-provided incidence rate.
| FIPS | State | County | Population | Cases | OR_Incidence | RPr_Incidence | Difference |
|---|---|---|---|---|---|---|---|
| 1115 | Alabama | St. Clair | 88690 | 1151 | 1349.52 | 1297.78 | -51.74 |
| 1117 | Alabama | Shelby | 215707 | 2911 | 1297.78 | 1349.52 | 51.74 |
| 5123 | Arkansas | St. Francis | 25439 | 1112 | 704.16 | 4371.24 | 3667.08 |
| 5125 | Arkansas | Saline | 121421 | 855 | 397.33 | 704.16 | 306.83 |
| 5127 | Arkansas | Scott | 10319 | 41 | 314.15 | 397.33 | 83.18 |
| 5129 | Arkansas | Searcy | 7958 | 25 | 1322.08 | 314.15 | -1007.93 |
| 5131 | Arkansas | Sebastian | 127753 | 1689 | 5420.39 | 1322.08 | -4098.31 |
| 5133 | Arkansas | Sevier | 17139 | 929 | 570.08 | 5420.39 | 4850.31 |
| 5135 | Arkansas | Sharp | 17366 | 99 | 4371.24 | 570.08 | -3801.16 |
| 8039 | Colorado | Elbert | 26282 | 85 | 626.60 | 323.42 | -303.18 |
| 8041 | Colorado | El Paso | 713856 | 4473 | 323.42 | 626.60 | 303.18 |
| 8065 | Colorado | Lake | 7824 | 70 | 349.85 | 894.68 | 544.83 |
| 8067 | Colorado | La Plata | 56310 | 197 | 894.68 | 349.85 | -544.83 |
We found that 13 counties had incorrect COVID-19 incidence scores, and the scores seem to be transposed from other counties, such that the overall descriptive statistics were accurate but the correlation coefficients were inaccurate. This finding implies that subsequent analyses using the COVID-19 Incidence rate will be slightly different and more accurate in this reproduction study than in the original study.
Unplanned deviation for reproduction: Join the original author’s Incidence data into our reproduction data frame so that we can later test for sensitivity to this error. Then report any counties for which the reproduced COVID incidence rate differs from the original author’s COVID incidence rate.
| county_st | covid_rate | or_incidence |
|---|---|---|
| St. Clair County, Alabama | 1297.78 | 1349.52 |
| Shelby County, Alabama | 1349.52 | 1297.78 |
| St. Francis County, Arkansas | 4371.24 | 704.16 |
| Saline County, Arkansas | 704.16 | 397.33 |
| Scott County, Arkansas | 397.33 | 314.15 |
| Searcy County, Arkansas | 314.15 | 1322.08 |
| Sebastian County, Arkansas | 1322.08 | 5420.39 |
| Sevier County, Arkansas | 5420.39 | 570.08 |
| Sharp County, Arkansas | 570.08 | 4371.24 |
| Elbert County, Colorado | 323.42 | 626.60 |
| El Paso County, Colorado | 626.60 | 323.42 |
| Lake County, Colorado | 894.68 | 349.85 |
| La Plata County, Colorado | 349.85 | 894.68 |
The join worked, highlighting the same 13 counties with inconsistent incidence rates. This also confirms that our reproduced dependent variable is identical to the original dependent variable with the exception of these three counties.
Unplanned Deviation for Reproduction: The dependent and independent variables in this study do not have normal distributions, as shown in the Shapiro-Wilk test results above. Therefore, we deviate from the original study to use the Spearman’s Rho non-parametric correlation test.
Compare the Spearman’s rho correlation coefficients to the reproduced Pearson’s r correlation coefficients. Differences are calculated as Spearman’s Rho - Pearson’s R.
| Variable | R | Stars | Rho | Stars | Rho - R | Stars | Direction |
|---|---|---|---|---|---|---|---|
| dis_pct | -0.060 | 2 | -0.113 | 2 | -0.053 | 0 | 0 |
| white_pct | -0.332 | 2 | -0.421 | 2 | -0.089 | 0 | 0 |
| black_pct | 0.460 | 2 | 0.575 | 2 | 0.115 | 0 | 0 |
| native_pct | 0.019 | 0 | -0.084 | 2 | -0.103 | 2 | -1 |
| asian_pct | 0.094 | 2 | 0.194 | 2 | 0.100 | 0 | 0 |
| other_pct | 0.026 | 0 | 0.104 | 2 | 0.078 | 2 | 0 |
| non_hisp_white_pct | -0.361 | 2 | -0.454 | 2 | -0.093 | 0 | 0 |
| hisp_pct | 0.119 | 2 | 0.231 | 2 | 0.112 | 0 | 0 |
| non_hisp_non_white_pct | 0.442 | 2 | 0.481 | 2 | 0.039 | 0 | 0 |
| bpov_pct | 0.106 | 2 | 0.062 | 2 | -0.044 | 0 | 0 |
| apov_pct | -0.151 | 2 | -0.205 | 2 | -0.054 | 0 | 0 |
| pct_5_17 | 0.084 | 2 | 0.079 | 2 | -0.005 | 0 | 0 |
| pct_18_34 | 0.063 | 2 | 0.034 | 1 | -0.029 | -1 | 0 |
| pct_35_64 | -0.008 | 0 | -0.020 | 0 | -0.012 | 0 | 0 |
| pct_65_74 | -0.091 | 2 | -0.151 | 2 | -0.060 | 0 | 0 |
| pct_75 | -0.186 | 2 | -0.285 | 2 | -0.099 | 0 | 0 |
| male_pct | -0.134 | 2 | -0.201 | 2 | -0.067 | 0 | 0 |
| female_pct | 0.023 | 0 | -0.014 | 0 | -0.037 | 0 | -1 |
Three variables change significance levels, with Native American and Other races gaining significance and age 18-34 losing significance. Two correlations change direction, with both Native American race (illustrated in scatterplot below) and Female households switching from positive correlations to negative correlations. Instabilities between the parametric and non-parametic correlations arise from variables with very skewed distributions and/or weak correlations at the county level. Some difference may also be attributable to the 13 counties with data errors in the COVID-19 Incidence Rate. In such distributions, outlier observations have more weight in the parametric Person’s R test than in the non-parametric Spearman’s Rho test.
Although there were no major geographic transformations in
this study, geographic grouping criteria for the generalized
estimating equation (GEE) models are defined as a combination of states
and COVID-19 risk, which is based on the Kulldorff spatial scan
statistic for geographic clusters of high COVID-19 incidence. The scan
statistic in SaTScan used spherical great circle distance calculations
based upon the latitude and longitude coordinates of the centroid of
each county. For this purpose, we have used the X and
Y attributes provided as geographic variables with ACS
data.
We use a Kulldorff spatial scan statistic to detect spatial clusters of high COVID-19 incidence (workflow step 6). The statistic uses a Monte Carlo simulation to calculate statistical significance, and therefore may not produce identical results each time.
The original study uses SaTScan software to implement the Kulldorff
spatial scan statistic model. In SaTScan, models may be specified with
many parameters having significant implications for results. The
original manuscript only specifies that Poisson model should be used. We
can also intuit that the model is discrete (locations are stationary and
non-random), and spatial only (there is no temporal dimension). The
author-provided SaTScan results SatScan_results.txt
contains additional parameters which appear to adhere closely to the
software’s default settings. These include the maximum cluster size of
“50 percent of population at risk”, and the “GINI optimized cluster
collection” and “no geographical overlap” options for detecting
secondary clusters. The “P-value Cutoff” for significant clusters option
did not appear in the v9.6 output, suggesting that the software only
allowed the default “no” option for this at the time of the original
study.
SaTScan software can also output two versions of geographic data:
col cluster polygon shapefile contains a circle for
each cluster, where each polygon is a circle defined by the cluster
center and radius. The attributes include a variable
REL_RISK for cluster relative riskgis location point shapefile contains one point for
each county in a cluster. The attributes include variables
LOC_RR for local relative risk and CLU_RR for
cluster relative riskThe SaTScan software implementation of the Kulldorff spatial scan statistic calculates two relative risk scores for locations:
REL_RISK in the
col cluster polygon shapefile and as CLU_RR in
the gis location point shapefile.LOC_RR in the
gis location shapefile, and is not calculated in the
col cluster polygon shapefile.For the purposes of interpreting the spatial scan statistic, a location is a county centroid while a cluster is a collection of counties with high incidence rates, defined in the shape of a circle with a center location (a county centroid) and a radius.
The original study is not clear about using the cluster geographic
data vs the location geographic data or the cluster relative
risk vs local relative risk. However, The author-provided
SatScan_results.txt results file indicates a geographic
cluster file but no location file, and the author-provided
Aug1GEEdata.csv data table contains a REL_RISK
field but no CLU_RR field or LOC_RR field.
This suggests that in the original study, the col polygon
cluster shapefile and cluster relative risk were used to
represent COVID-19 risk and define GEE clusters.
The spatial scan statistic is based on case counts and total population, and is therefore unaffected by errors in the COVID Incidence rate.
Planned deviation for reproduction: We opted to use the SpatialEpi package in R, selecting open source software with R integration over SatSCan software, which is free but not open. The Kulldorff spatial scan statistic model in SpatialEpi also supports a discrete Poisson spatial model, and uses the GINI coefficient to select secondary clusters with no geographical overlap that maximize the difference between locations inside of clusters and locations outside of clusters. We expected that this set of software options could reproduce identical results compared to SaTScan.
First, calculate the Kulldorff spatial scan statistic using SpatialEpi. Optionally, skip this code block due to long run times of more than 10 minutes.
Load pre-calculated Kulldorff spatial scan results. Alternatively, skip or modify this code block to use your own version of the SpatialEpi Kulldorff results.
Report Kulldorff spatial scan results.
## [1] Most likely cluster:
## $location.IDs.included
## [1] 1824 1835 1797 1818 1825 1749 1854 1742 1837 1747 1838 280 1760 1846 1756
##
## $population
## [1] 16949211
##
## $number.of.cases
## [1] 469091
##
## $expected.cases
## [1] 233805.6
##
## $SMR
## [1] 2.006329
##
## $log.likelihood.ratio
## [1] 97983.07
##
## $monte.carlo.rank
## [1] 1
##
## $p.value
## [1] 0.001
## [1] Number of Secondary clusters: 136
The SpatialEpi implementation of Kulldorff spatial scan
statistics provides output in the form of hierarchical lists analogous
to the text output of SaTScan, but does not output a simple data frame
or tabular output analogous to the shapefiles from SaTScan. Therefore,
additional steps are required to append the Kulldorff scan results to
the acs_covid simple features data frame. This can be done
by assigning unique cluster ID’s to each county within a cluster.
Clusters include the county at the center of a cluster and all of the
other counties within the cluster radius. Therefore, we use the FIPS
code of the county at the center of each cluster as the unique cluster
ID.
Unplanned deviation for reproduction: The original study does not include visualizations of the spatial structure and distribution of COVID-19 clusters.
First, we must join the Kulldorff spatial scan cluster IDs to the acs_covid simple features dataframe. Although this was planned in workflow step 9, the order of operations between steps 9 and steps 7 and 8 is not important.
Next, calculate a new field isCluster to identify
counties in COVID-19 clusters. Additionally, distinguish between
counties defining the center of a cluster from counties constituting
other parts of a cluster by comparing the cluster ID (equivalent to the
center county’s fips code) to the county fips code.
Planned deviation for reproduction: Map the
SpatialEpi cluster results.
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: use 'fill' for the fill color of polygons/symbols
## (instead of 'col'), and 'col' for the outlines (instead of 'border.col').
## [tm_polygons()] Arguments `fill.values`, `values.na`, and `color.alpha`
## unknown.
## This message is displayed once every 8 hours.
Unplanned deviation for reproduction: The
SpatialEpi implementation of Kulldorff spatial scan
statistics does not calculate local relative risk or cluster relative
risk. Therefore, the next step is to calculate local and cluster
relative risk (workflow step 7).
Classify relative risk on a scale from 1 to 6 (workflow step 8). Risk is classified according to this table:
| Relative Risk Values | Relative Risk Class |
|---|---|
| Outside of cluster | 1 |
| RR < 1 | 1 |
| 1 <= RR < 2 | 2 |
| 2 <= RR < 3 | 3 |
| 3 <= RR < 4 | 4 |
| 4 <= RR < 5 | 5 |
| 5 <= RR < 6 | 6 |
Counties falling outside of any cluster are assigned a score of 1.
Unplanned deviation for reproduction: It would be helpful to visualize the spatial distributions of local relative risk classes and Kulldorff cluster relative risk classes in advance of using these classes to control for spatial heterogeneity in GEE models.
First, map the spatial distribution of local relative risk score classifications.
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "cat"`, use fill.scale =
## `tm_scale_categorical()`.
## ℹ Migrate the argument(s) 'palette' (rename to 'values'), 'labels' to
## 'tm_scale_categorical(<HERE>)'
## [v3->v4] `tm_polygons()`: use `col_alpha` instead of `border.alpha`.
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Oranges" is named
## "brewer.oranges"
## Multiple palettes called "oranges" found: "brewer.oranges", "matplotlib.oranges". The first one, "brewer.oranges", is returned.
Next, map the cluster relative risk scores for comparison. Note that
following the original study classification methodology, counties
outside of clusters are assigned the lowest risk class of
1.
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "cat"`, use fill.scale =
## `tm_scale_categorical()`.
## ℹ Migrate the argument(s) 'palette' (rename to 'values'), 'labels' to
## 'tm_scale_categorical(<HERE>)'
## [v3->v4] `tm_polygons()`: use `col_alpha` instead of `border.alpha`.
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [cols4all] color palettes: use palettes from the R package cols4all. Run
## `cols4all::c4a_gui()` to explore them. The old palette name "Oranges" is named
## "brewer.oranges"
## Multiple palettes called "oranges" found: "brewer.oranges", "matplotlib.oranges". The first one, "brewer.oranges", is returned.
Comparing the cluster and local relative risk classifications for regions like the Southeast, it is apparent that some areas of high risk are represented with large clusters that have an averaging effect on the cluster-based relative risk score. This effect is more pronounced for clusters with low compactness (e.g. the Southeast cluster stretched over the “black belt” region from Louisiana and Arkansas to Georgia) than clusters with higher compactness (e.g. New York City) because the circular shape of clusters includes more low-risk counties.
The original study did not directly report any results from the
Kulldorff spatial scan statistic. However, the Kulldorff cluster
relative risk scores were combined with states to create clusters for
GEE models, hereafter called “GEE clusters”. The original study reported
102 unique GEE clusters having a range of 1 to
245 counties in each cluster.
In order to compare results, we first create cluster IDs as combinations of the state ID and COVID relative risk class. The first clustering ID (State) and second clustering score (COVID relative risk class) were combined to form IDs for each unique combination of state and relative risk class. Then, we find the number of unique clusters and frequency counties per cluster in our reproduction study for comparison to the original study.
## 111 unique clusters based on spatialEpi CLUSTER relative risk
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 7.00 27.56 50.50 159.00
We failed to reproduce the same configuration of GEE clusters as the original study, finding 9 more clusters than the original study and a much smaller maximum cluster of 159 counties compared to 245 counties.
Unplanned deviation for reproduction: Upon failing
to reproduce an identical number of GEE clusters using SpatialEpi in R,
we reproduced the procedure in the free but not open SaTScan software,
using the current software version 10.1. The input data files
(case, Coordinates.geo, and
Population.pop), and output data files
(sat_scan_rpr.txt, sat_scan_rpr.col.shp, and
sat_scan_rpr.gis.shp) are found in the
data/derived/public/satscan directory. The
sat_scan_rpr.txt file reports the model parameters used in
addition to results.
Although it is not ideal to intercede with this unplanned deviation at this step, is the first step in the methodology following the Kulldorff spatial scan statistic with a result reported in the original publication.
First, load and verify whether our SaTScan reproduction data compares to the author-provided SaTScan data.
## 96 reproduced relative risk observations
## 96 author-provided relative risk observations
## 96 reproduced relative risk values match the original author's relative risk values
Our SaTScan results exactly reproduced the author-provided SaTScan results data.
Join the SaTScan results to acs_covid for mapping and
analysis.
## Joining 96 records with 96 unique LOC_ID county values
Unplanned deviation for reproduction: Visualize the spatial distribution of the author-provided Kulldorff COVID-19 Clusters.
##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [tm_polygons()] Arguments `fill.values` and `value.na` unknown.
In the map above, clusters containing only one county have no visible circle. Clusters containing two counties are encircled, but have no label. Clusters containing three or more counties are encircled and labelled with the number of counties.
Note that this version of data only includes the 96 counties defining cluster centers, visualized with fill colors above. The data excludes all of the non-center counties in clusters with more than one county. The extent of these larger clusters is visualized by unfilled circles defined by cluster radii.
Additionally, the SaTScan software confusingly merges two sets of
clusters in the results when the user uses the (default) option for
GINI-optimized clusters. One set of results is a hierarchical
non-overlapping set of clusters. These clusters are noted with
GINI_CLUST = F in the results. The second set of results is
a set of hierarchical non-overlapping clusters designed to maximize the
GINI coefficient of inequality between counties within clusters and
counties outside of clusters. These clusters are noted with
GINI_CLUST = T in the results.
Merged together as they are, the two sets of secondary clusters overlap one another geographically, causing ambiguity in terms of which cluster-based relative risk score should be used at each location.
Unplanned deviation for reproduction: Can we also use these reproduced SaTScan results to exactly reproduce the author-reported frequency of original GEE classes and maximum counties per class? If the results match, it will confirm that the problems identified above have propagated through the original study analysis.
## 102 unique clusters based on spatialEpi CLUSTER relative risk
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 3.00 29.99 58.75 245.00
Using SaTScan Kulldorff clusters, we have exactly reproduced the author-reported frequency of original GEE classes and maximum counties per class. We have confirmed that the original study used the cluster relative risk of the center county of each cluster, including both the hierarchical and GINI-optimized sets of clusters.
Unplanned Deviation for Reanalysis: At this point it
is clear that the best decision will be to shift from a
reproduction study to a reanalysis study,
intentionally altering methodological decisions to achieve a more valid
outcome. We prefer to include all counties contained in each
cluster, and to use only one set of non-overlapping clusters,
as produced by the SpatialEpi algorithm.
Given the shifting goal, how sensitive is this study to the choice of computational environment for the Kulldorff scan statistics? To answer this question, we must load the local SaTSCan results inclusive of all counties within clusters, filter the results to focus on the standard hierarchical set of clusters, and compare the spatial distributions of the SaTScan and SpatialEpi results.
## SaTScan combined GIS output has 1306 records with 922 unique county values
## SaTScan Hierarchical clusters include 605 records with 605 unique county values
## SaTScan GINI-optimized clusters include 701 records with 701 unique county values
It was necessary to divide the Hierarchical clusters from the GINI clusters to avoid duplicates and geographic overlap.
Compare the SaTScan Hierarchical clusters to the SpatialEpi clusters.
The two methods only agree on the definition of the largest clusters in distant regions. Thereafter, SpatialEpi detects many secondary clusters in the vicinity of the largest ones, while SaTScan detects seven isolated and low-probability counties.
Compare the SaTScan GINI Optimized clusters to the SpatialEpi clusters.
There is more agreement overall between SpatialEpi and SaTScan GINI Optimized clusters. The two algorithms agree the most for smaller and less significant clusters above the 95% confidence threshold. Because the SaTScan clusters are more limited in size, SaTScan detects several smaller clusters with gaps in place of the largest SpatialEpi clusters.
Keeping in mind that the final analysis uses a classification of cluster relative risk for GEE models, are there important differences between the two results with regard to classification of risk? We can check by calculating cluster relative risk classes based on the SaTScan GINI clusters, and cross-tabulating with the SpatialEpi risk classes.
| 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|
| 1 | 2090 | 78 | 0 | 0 | 0 | 0 |
| 2 | 308 | 528 | 9 | 0 | 0 | 0 |
| 3 | 7 | 6 | 69 | 1 | 0 | 0 |
| 4 | 1 | 3 | 0 | 5 | 0 | 0 |
| 5 | 1 | 0 | 0 | 0 | 1 | 0 |
| 6 | 0 | 0 | 0 | 0 | 0 | 1 |
Indeed, SpatialEpi has identified more than 300 counties with above normal risk that were not identified by SaTScan. Meanwhile, SaTScan identified 78 counties with above normal risk that were not identified by SpatialEpi.
The maps and crosstabulation above indicate that there are important differences between the SaTScan and SpatialEpi computational environments for calculating secondary clusters.
We summarize our understanding of the computational differences for default settings below, based on close examination of our software outputs, technical documentation for SaTScan, and the documentation and code repository for SpatialEpi.
| SaTScan Hierarchical | SaTScan GINI | SpatialEpi | |
|---|---|---|---|
| possible shapes | circle (default) or ellipse | circle (default) or ellipse | circle |
| possible cluster centers | locations with rates > normal | locations with rates > normal | all locations |
| maximum cluster size | 50% of cases | varies, not exceeding 50% of cases | 50% of population |
| maximum p of cluster | 1.00 | 1.00 | 0.05 |
| distance | spherical great circle | spherical great circle | spherical equidistant cylindrical projection |
To further interrogate the differences in sets of secondary clusters, we must understand that theoretically each location (county), may be the center of many different circular clusters defined by different radii, starting with a radius of 0 and the one county at the center, and expanding until the maximum cluster size is reached.
Comparison of the results from SaTScan and SpatialEpi
Kulldorf analyses show that the two methods were similar, with the large
majority of RR classes overlapping. Most values that did not map
directly were only one class away (mostly just between risk classes 1
and 2). Analysis of the overlaid map shows that the Kulldorf analysis
using SpatialEpi yielded larger contiguous risk areas,
particularly in Florida, southern California and the southwest, and in
southern Texas. SatScan yeilded more risk clusters in the northeast,
northern Texas, and the four corners region.
Results can be interpreted in two ways. First, results from the
Kulldorf spatial scan can be compared to the SaTScan results loaded in
as a representation of the original study methods and result. If these
two figures are similar, it indicates that the reproducible code using
the SpatialEpi package that generated the figures is
accurate in reproducing the study. Secondly, given our hypothesis
surrounding possible clustering along urban cooridors, a result in which
Kulldorf results show more clustered clusters would indicate support for
the hypothesis.
Results indicated that the two methods were similar enough to
succesfully reproduce the study. Additionally, they support the
hypothesis predicting that SpatialEpi would better predict
clustered urban areas.
Having an open-source version of the study that determines links between disability, social vulnerability, and relative risk could be an incredibly useful tool in future epidemiological applications. As the frequency of epidemics and pandemics increases and social vulnerability categories become increasingly enshrined, methods like these could be critical to quickly identify at-risk groups and respond earlier to serious challenges.
The authors of this preregistration state that they completed this preregistration to the best of their knowledge and that no other preregistration exists pertaining to the same hypotheses and research.
This report is based upon the template for Reproducible and Replicable Research in Human-Environment and Geographical Sciences, DOI:[10.17605/OSF.IO/W29MQ](https://doi.org/10.17605/OSF.IO/W29MQ)
Reproduction of Chakraborty 2021: An intracategorical analysis of COVID-19 and people with disabilities Joseph Holler, Junyi Zhou, Peter Kedron, Drew An-Pham, Derrick Burt Version 2.0 | First Created July 7, 2021 Updated June 26, 2023 https://github.com/josephholler/RPr-Chakraborty-2021